Topic Factor Modelling: uncovering thematic structure in financial data

ثبت نشده
چکیده

We examine the task of finding thematic structure in a data corpus comprising text and time series, motivated by applications to financial data such as improving correlation estimation. We introduce a topic factor model (TFM): a joint generative model for both text and time series data which resembles supervised latent Dirichlet allocation. Our TFM allows the decomposition of time series into factors which also reflect the thematic content of the text. The structure is found using mean field variational inference, though this is complicated by the lack of a closed form update for some variables. The key modelling challenge is balancing the combination of continuous and discrete distributions. We use a corpus from the foreign exchange market to demonstrate improved likelihood of held out time series data. In finance, decomposition of time series into the key driving forces is commonplace. PCA and related methods are used to identify features of comovement of prices. Alternatively, returns can be attributed to some set of economic variables by regression. The resulting, ubiquitous, probabilistic models of returns are called factor models (see, for example Fama and French [2]). We are not aware of any previous attempt at a middle ground of automatically detecting a time series decomposition where the components have economic meaning. Joint topic modelling of text and time series should be able to achieve just that. By extension to the above mentioned methods we call our model a topic factor model (or TFM). It differs from existing joint models in that the non-text variables are not sampled by first drawing a topic from the topic distribution, but rather depend directly on the topic proportions. This creates some issues around balancing discrete and continuous probability but gives rise to a model which both corresponds better to our intuition and is able to outperform existing models. The canonical example of mixing text with continuous data is supervised latent Dirichlet allocation (sLDA) [1], which models text data and adds a response variable with distribution given by a generalized linear model. Under sLDA, however, topics can only affect the response variable when also used to explain associated text. For some applications one might require the model have the flexibility to allocate mass to topics and explain the response variable without using that mass to explain text. This is particularly the case for financial data, where thematic structure omitted from text indicates information not considered by the authors of the text and may indicate a valuable competitive advantage. Another option would be to use Dirichlet multinomial regression [3] which has features upstream of the document topic distribution. Sadly this doesn’t generalize to new features, which is important in the context of financial time series because the distribution of tomorrow’s returns is critical. For these reasons we use a TFM similar in form but slightly different to sLDA.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Interactive Visual Exploration of Topic Models using Graphs

Probabilistic topic modeling is a popular and powerful family of tools for uncovering thematic structure in large sets of unstructured text documents. While much attention has been directed towards the modeling algorithms and their various extensions, comparatively few studies have concerned how to present or visualize topic models in meaningful ways. In this paper, we present a novel design th...

متن کامل

Designing Capital Market Financial Instrument for Energy Efficiency Projects in Residential Buildings Using Thematic Analysis

One of the barriers in implementing energy efficiency projects in residential buildings is financing limitation. Financing of such projects requires more complex financial structure due to their multilateral nature and can be done through capital market. In this study we introduce some financial instrument for financing of energy efficiency projects in the context of Iran capital market by them...

متن کامل

The Factor Structure of a Written English Proficiency Test: A Structural Equation Modeling Approach

The present study examined the factor structure of the University of Tehran English Proficiency Test (UTEPT) that aims to examine test takers’ knowledge of grammar, vocabulary, and reading comprehension. A Structural Equation Modelling (SEM) approach was used to analyse the responses of participants (N= 850) to a 2010 version of the test.  A higher-order model was postulated to test if the unde...

متن کامل

Drawing Co-Citation Networks of Corona Virus Studies

Background and Aim: The purpose of the present study is to map the coronavirus domain citation network to better understand this domain based on all other citation networks.  Materials and Methods: The present study is applied in terms of purpose, and is descriptive scientometrics in terms of type, which has been done with the all-citation method. In this study, all scientific publications on ...

متن کامل

Kernel Topic Models

Latent Dirichlet Allocation models discrete data as a mixture of discrete distributions, using Dirichlet beliefs over the mixture weights. We study a variation of this concept, in which the documents’ mixture weight beliefs are replaced with squashed Gaussian distributions. This allows documents to be associated with elements of a Hilbert space, admitting kernel topic models (KTM), modelling te...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013